Fault Tolerant Wide-Area Parallel Computing
نویسنده
چکیده
Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple-data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods1
منابع مشابه
Fault Tolerant Scheduling in Distributed Networks
We present a model for application-level fault tolerance for parallel applications. The objective is to achieve high reliability with minimal impact on the application. Our approach is based on a full replication of all parallel application components in a distributed wide-area environment in which each replica is independently scheduled in a different site. A system architecture for coordinati...
متن کاملFIRE Ant with Activity Structures: a Biologically Motivated Approach to Fault-Tolerant Routing in Global Networks
– This paper presents a new approach to fault-tolerant routing for large-scale distributed parallel computing that will allow communications to continue to take place in the presence of many faults, while maintaining low overhead on the system. FIRE ANT algorithm uses its emergent behavior in order to make intelligent decisions (avoidance) when multiple failures exist between Autonomous Systems...
متن کاملAn approach to fault detection and correction in design of systems using of Turbo codes
We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...
متن کاملMPI/FT: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing
MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault -tolerant MPI middleware. Environments include space -based, wide -area/web/meta computing, and scalable clusters. MPI/FT , the system described here, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirem...
متن کاملMPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing
MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault-tolerant MPI middleware. Environments include space-based, wide-area/web/meta computing, and scalable clusters. MPI/FT, the system described here, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirements...
متن کامل